The Annals of Statistics 
2013, Vol. 41, No. 1, 91-124 
DOI: 10.1214/12-AOS1075 

© Institute of Mathematical Statistics. 2013 

CONVERGENCE RATE OF MARKOV CHAIN METHODS FOR 
GENOMIC MOTIF DISCOVERY 1 

By Dawn B. Woodard and Jeffrey S. Rosenthal 
Cornell University and University of Toronto 

We analyze the convergence rate of a simplified version of a pop- 
ular Gibbs sampling method used for statistical discovery of gene 
regulatory binding motifs in DNA sequences. This sampler satisfies 
a very strong form of ergodicity (uniform). However, we show that, 
due to multimodality of the posterior distribution, the rate of conver- 
gence often decreases exponentially as a function of the length of the 
DNA sequence. Specifically, we show that this occurs whenever there 
is more than one true repeating pattern in the data. In practice there 
are typically multiple such patterns in biological data, the goal being 
to detect the most well-conserved and frequently-occurring of these. 
Our findings match empirical results, in which the motif-discovery 
Gibbs sampler has exhibited such poor convergence that it is used 
only for finding modes of the posterior distribution (candidate mo- 
tifs) rather than for obtaining samples from that distribution. Ours 
are some of the first meaningful bounds on the convergence rate of 
a Markov chain method for sampling from a multimodal posterior 
distribution, as a function of statistical quantities like the number of 
observations. 

1. Introduction. Gene regulatory binding motifs are short DNA sequences 
that control gene expression. The identification of these regulatory motifs 
poses several challenges: they are only 6-15 base pairs in length, and do not 
contain clear start and stop codons; a regulatory motif is indistinguishable 
from random sequences of the same length except that it is a particular 
sequence that occurs more frequently than expected under the background 
model. Discovery of previously undescribed regulatory motifs in DNA se- 
quences thus involves both finding such a repeating pattern ("motif") and 
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Fig. 1. Illustration of motif discovery: finding an unknown repeating pattern in a long 
DNA sequence. The pattern can vary slightly between instances. 

determining where that pattern occurs in the sequences [Kellis et al. (2004)]; 
this is illustrated in Figure 1. 

One of the most effective methods for identifying new regulatory motifs is 
based on a statistical model and associated Gibbs sampling computational 
method [Liu, Neuwald and Lawrence (1995)]. This approach has been pop- 
ularized with the availability of software programs for its use, such as Bio- 
Prospector [Liu, Brutlag and Liu (2001)] and AlignAce [Roth et al. (1998)]. 

Like most other methods for identifying regulatory motifs, the Gibbs sam- 
pling method often yields different answers when starting from different ini- 
tial configurations. The method is applied by rerunning the Gibbs sampler 
many times, using randomly generated initial positions. The resulting candi- 
date motifs are sorted according to some goodness-of-fit measure, and then 
the highest-scoring motifs are reported [Lawrence et al. (1993), Liu, Brutlag 
and Liu (2001), Jensen et al. (2004)]. This fact contrasts with the theoretical 
properties and traditional use of a Gibbs sampler, namely to be simulated 
until it has some claim of having converged to the posterior distribution, at 
which point the answer should be the same regardless of initialization. 

We address a particular model and Gibbs sampler that are representative 
of this family of methods. We analyze the convergence rate of a simplified 
version of the Gibbs sampler and show that, due to multimodality of the 
posterior distribution, the convergence rate typically decreases exponentially 
as a function of the DNA sequence length (Theorem 3.2). Specifically this 
occurs when there is more than one true repeating pattern in the data, 
meaning that the DNA is made up of short subsequences, each of which is 
either equal to one of several motifs or is generated from the background 
model. In practice there are typically multiple distinct repeating patterns 
in biological data, corresponding to multiple gene regulatory binding mo- 
tifs or to repeating patterns that have other biological significance, such as 
"determinants of mRNA stability or even sites for regulation by antisense 
transcripts" [Roth et al. (1998)]. The goal is to detect the most frequently- 
occurring and well-conserved motif or motifs [Neuwald, Liu and Lawrence 
(1995)]. So in practice we can expect the sampler convergence rate to decay 
exponentially; this is equivalent to the run time of the algorithm growing ex- 
ponentially in the sequence length, for a fixed accuracy. The multimodality 
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Fig. 2. The posterior density estimates of 0i, i (A) from two different Gibbs sampling 
chains, in the case of two true motifs. 

of the posterior and resulting poor convergence are illustrated in Figure 2, 
which shows posterior density estimates of a particular function of the pa- 
rameter vector, from two different Gibbs sampling chains. Initialized with 
distinct parameter values, the two chains have become trapped in different 
modes of the posterior density and thus have not yet individually converged 
to the posterior distribution. 

The multimodality of the posterior distribution arises due to a contra- 
diction between the data, which typically have multiple true repeating pat- 
terns, and the model assumption of a single such pattern. Practitioners use 
the model not because it is believed to precisely capture the true process 
that generated the data (which is extremely complex) but because it cap- 
tures several important features of that process [Neuwald, Liu and Lawrence 
(1995), Roth et al. (1998)]. Our results show that the presence of multiple 
motifs, even if some occur very infrequently, causes slow convergence. Recog- 
nizing that there can be multiple true motifs, a variant on the Gibbs sampler 
has been proposed that allows for a fixed number of motifs greater than one 
[Neuwald, Liu and Lawrence (1995)]. This approach is only likely to fix the 
slow convergence if the number of motifs in the model is at least as large 
as the number of true motifs in the data. This is only a practical solution if 
the number of true motifs is small. 

Our simplification of the model and associated Gibbs sampler assumes 
that motifs can only end at locations in the sequence that are divisible by 
the motif length, instead of at arbitrary locations (Section 2.2). This is done 
to facilitate analysis, by avoiding the "phase shift" issue that occurs in the 
original sampler [Lawrence et al. (1993), Liu (1994)]. Since phase shift slows 
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convergence of the chain, it is likely (but unproven) that our results on slow 
convergence of the simplified chain also hold for the original chain. 

We also give evidence supporting the conjecture that the convergence rate 
decreases polynomially if there is no more than one true (and identifiable) 
motif in the data. We give empirical evidence, and prove polynomial decay 
of the convergence rate for the case of length-one motifs. In this case any 
true motifs are nonidentifiable; see Theorem 3.3. 

Ours are some of the few meaningful bounds on the convergence rate of 
a Markov chain method used in Bayesian statistics, as a function of sta- 
tistical quantities such as the number of observations or number of groups. 
Such results are particularly rare for multimodal posterior densities. Roberts 
and Sahu (2001) show that the convergence rate of a Gibbs sampler for a 
unimodal posterior density in M. d approaches a constant as the number of ob- 
servations increases. Belloni and Chernozhukov (2009) show that if the pos- 
terior density converges uniformly to a normal density, then a Metropolis- 
Hastings chain restricted to a neighborhood of the true parameter value 
has polynomially decaying convergence rate. Jones and Hobert (2001, 2004) 
and other authors [e.g., Rosenthal (1995, 1996)] obtain bounds on the time 
to be within distance e > of convergence for various hierarchical random 
effect models having unimodal posterior densities, as a function of the ini- 
tial values, data and hyperparameters. Mossel and Vigoda (2006) show that 
the convergence rate of a Markov chain method used in Bayesian phyloge- 
netics can decrease exponentially in the number of samples in the dataset. 
We also learned after completing this article that Dr. Scott Schmidler at 
Duke University has independently obtained some convergence results in 
the motif-discovery context (personal communication). 

Showing that a Markov chain method used in statistical practice is "well- 
behaved" usually consists of proving geometric ergodicity [Liu, Wong and 
Kong (1995), Jarner and Hansen (2000), Fort et al. (2003), Johnson and 
Jones (2010)], meaning that the chain converges to the posterior distribution 
at a geometric rate. The Gibbs sampler we analyze satisfies the even stronger 
property of uniform ergodicity; despite this, it is so poorly-behaved as to be 
unusable for obtaining samples from the posterior distribution for long DNA 
sequences. 

Characterizing the dependence of the convergence rate on statistical quan- 
tities like the number of observations or the number of parameters is critical 
in justifying the use of a Markov chain method. However, there are several 
difficulties in doing so. First, the posterior distribution of a statistical model 
has a much more complex form than the stylized distributions for which 
Markov chain convergence rates are typically obtained [Borgs et al. (1999), 
Bhatnagar and Randall (2004), Woodard, Schmidler and Huber (2009b)]. 
Second, the data, and thus the convergence rate of the Markov chain, are 
stochastic and depend on the data-generating mechanism. 
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We address these challenges by utilizing Bayesian asymptotic theory, 
which characterizes the behavior of the posterior distribution as the num- 
ber of observations grows. This is complicated by the fact that Bayesian 
asymptotic theory is most well developed in the case of a continuous param- 
eter space, but the motif Gibbs sampler is defined on a discrete parameter 
space. We handle this by applying the asymptotic results on an alternative 
continuous parameterization of the motif model and then mapping those 
results to the discrete parameterization. Due to these technical challenges 
our main theorem requires sufficiently long motifs, and is restricted to the 
case where each true motif corresponds to a fixed sequence of nucleotides 
(disallowing the small variations seen in Figure 1). We give an additional 
argument and simulation results suggesting that slow mixing holds even for 
very short motifs, and when the true motifs are allowed to vary between 
instances. 

The motif discovery example provides insights into the dynamics of stan- 
dard Markov chain methods applied to statistical models with highly mul- 
timodal posterior distributions. Other examples that may have the same 
exponential-time property include Markov chains for model search in the 
context of regression with a large number of predictors [Liang and Wong 
(2000), Hans, Dobra and West (2007)] and Markov chains for spatial mix- 
ture models based on random fields [Geman and Geman (1984), Green and 
Richardson (2002)]. Our example also provides a test case for the use of 
more sophisticated Markov chain methods that are designed to handle mul- 
timodality [Del Moral, Doucet and Jasra (2006), Andrieu, Doucet and Holen- 
stein (2010)]: if a method can be shown to sample from the posterior dis- 
tribution of the motif-discovery model in polynomial time, then it would be 
dramatically more efficient than the Gibbs sampling approach. 

Background on the Gibbs sampling method for motif discovery and on 
Markov chain convergence rates is in Section 2. Our convergence results are 
in Section 3, and a simulation study is given in Section 4. The proof of our 
main result is in Section 5, and we draw conclusions in Section 6. 

2. Background. 

2.1. Statistical motif discovery. The goal of motif discovery is to find 
short sub-sequences of nucleotides (length 6-15 base pairs) that occur multi- 
ple times (more often than could be explained under the background model) 
in one or more long DNA sequences. Neither the nucleotide pattern nor the 
sub-sequence locations are known. This goal is illustrated in Figure 1. 

We address one of the two main variants of Gibbs sampler used in mo- 
tif discovery. The variant we analyze takes the number of motif instances 
per sequence to be unknown, while the other variant fixes the number of 
instances per sequence [Jensen et al. (2004)]; the two approaches are closely 
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related and should have similar properties. Programs such as BioProspector 
are based on the method we analyze, and build in a number of additional 
features, such as a prior distribution on the motif frequency and handling of 
gapped motifs; however, by adding parameters and complexity to the model 
these enhancements probably make the Gibbs sampler slower to converge, 
and so are unlikely to affect our slow-mixing results. 

We focus further on the case of a single DNA sequence (having an un- 
known number of motif instances). The case of multiple sequences can be 
addressed by concatenating to obtain a single sequence. 

The motif instances are not necessarily identical. Taking the length w 
of the motif to be known, one can describe the nucleotide pattern by a 
position-specific frequency matrix, which contains the probability of oc- 
currence of each nucleotide at each position in the motif. Call this matrix 
9\:w — (0i> ■ ■ • ) &w)i where 6^ is the unknown probability vector for the feth 
position. Let the nucleotides be labeled 1, . . . ,M, so that 9^ has length M; 
for DNA data M = 4. For each instance of the motif, the nucleotide in po- 
sition k is assumed to be drawn independently from a discrete distribution 
with parameters Ok- The positions in the full sequence that are not part of 
a motif instance are assumed to have nucleotide drawn independently from 
a discrete distribution with unknown probability vector 6q. 

Let S = (Si, . . . , S L ) G {1, . . . , M} L be the observed sequence, having 
length L. In the original model of, for example, Liu, Neuwald and Lawrence 
(1995), a motif is allowed to start at any index i £ {1, . . . ,L — w + 1}, but 
we will analyze a simplified version that only allows a motif to start at in- 
dices wi — w + 1 for % G {1, . . . ,L/w} where L is divisible by w. This choice 
is explained in Section 2.2. Let A{ G {0, 1} be the (unknown) indicator of 
whether a motif begins at index wi — w + 1, for i G {1, . . . ,L/w}, and de- 
fine A = (Ai, . . . ,A L / W ). Let N(A( fc )) be the length-M vector of counts of 
the occurrence of each nucleotide at position k G {1, . . . ,w} of all motif in- 
stances, conditional on A. Similarly, N(A C ) is defined to be the length-M 
vector of counts for each nucleotide in the background locations, that is, the 
locations that are not part of any motif instance, 

L/w 

N(A^) m 4 £ l {Ai= i, Swi _ w+k=mh k G {1, . . . , «,}, m € {1, . . . , M}, 

t=i 

w 

(2.1) A(A c ) m 4iV(S) m -^Ar(A( fc )) m , 

k=l 

L 

N(S) m ^J2 1 iS i =m}- 

i=l 
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For any two equal-length vectors /3 = (/?i, . . . , (3k) and n = (ni, . . . , uk) 
define the notation 



K 



A 



(2.2) /3 n = n/f fc , r(n) = l]rw, 

k=l k=l 



K 

|n| = y^n fc , 



fc=i 



where T is the gamma function. Using this notation, the likelihood condi- 
tional on A can be written as 



L/w 



7rfS|A,0 



0:w 



>=n 



(2.3) 



1 = 1 



IK* 



wi — w-\-k 



k=l 



4^=1} 



k=l 



'0,5 



wi — w-\-k 



k=l 



N(A( fe )) 



where 9q :w = (Oq, 0\, . . . , W ) denotes all model parameters. We will use n 
to indicate the likelihood, prior or the full, marginal or conditional posterior 
distributions as distinguished by its arguments. 

The prior distributions for the unknown quantities are 1^(0^) = Dirichlet(/3 fe ) 
for k 6 {0, . . . , w} and ~n{Ai = 1) = po independently. Here po € (0, 1) is a 
known constant and (3 , . . . ,[3 W are fixed length-M vectors with (3k,m > 0. 
The corresponding posterior distribution is [Jensen et al. (2004)] 



(2.4) 



7T(A, O :iu|S) (X 7T(A, O:w , S) 



■Po 



A| (1 _ po) V-|A| x e ^) + ^l x f[^ AWy+P "-\ 



fc=l 



Liu, Neuwald and Lawrence (1995) integrate out the parameters 9q :w in 
the above formula to yield a posterior distribution on A. Using the notation 
from (2.2), 



(2.5) 



vr(AIS) 



ocp A| (l-Po) L/ 



L/w _ lAl r(N(A c )+/3 c 



r(|N(A-)| + |/3 | 



k=l 



r(N(AW)+ft) 
|N(A(*))| + |/3 fc |)- 



Liu (1994) gives theoretical results supporting faster convergence of a Gibbs 
sampler for the reduced posterior 7r(A|S) relative to a Gibbs sampler for 
7r(A, #o:u>|S). So Liu, Neuwald and Lawrence (1995) propose to use a Gibbs 
sampler to draw from 7r(A|S), having state vector A £ X = {0, Y\ L I W . This 
sampler iteratively updates each A{ according to its conditional posterior 
distribution; details are given in Section 5.1. 

Although this Gibbs sampler has both systematic-scan and random-scan 
versions, we expect that the mixing properties (defined in Section 2.3) of 



8 



D. B. WOODARD AND J. S. ROSENTHAL 



the two versions are identical. For this reason we focus attention on the 
random-scan Gibbs sampler, which is easier to analyze. We also make the 
transition matrix of the Markov chain nonnegative definite by including a 
holding probability of 1/2 at every state; this is a common technique when 
analyzing the mixing properties of Markov chains [Madras and Zheng (2003), 
Woodard, Schmidler and Huber (2009a)]. It only increases the mixing time 
(Section 2.3) by a factor of two, so it does not affect results on the order of 
the run time as a function of L. Let Ar^i indicate the vector A excluding 
the ith element, and 7r(^LJA[_j], S) oc7r(A|S) indicate the conditional pos- 
terior distribution of Ai given Ar_£i. With these definitions we can write the 
transition matrix T of the Gibbs sampler as follows for any A, A' G X: 

nA,A0--^x;i {A ^- A| _, |} 

i=i 

(2-6) x \[1 {K =a i} + <Ai = 0[A H] , S)1 K=0} 

+ vr(^ = l|A H] ,S)l { ^ =1} ]. 

Expressions for 7r(^4j|A[_j], S) are given in Section 5.1. 

We also need an expression for the likelihood marginalized over A. Using 
the vector notation S n:m = (S n , . . . , S m ) for n <m, 

L/w 

(2.7) it(S\6 0:w ) = Y\ f (Swi-w+i:wi\Oo:w) where 

w w 

(2.8) f(s\e .. w )±p l[9 k , Sk + (l- P o)l[0 , Sk , se{l,...,Mf. 

k=l k=l 

Under model (2.7) and (2.8) each subsequence S w i- w+ i- w i is either gen- 
erated from the motif 6± :w with probability po, or generated from the back- 
ground 6q with probability (1 —po)- So we have i.i.d. observations Sm-w+i-.wi 
for i G {1, . . . ,L/w}, which will allow us to use Bayesian asymptotic theory 
for i.i.d. parametric models. 

2.2. Reason for the simplification. As stated in Section 2.1, while the 
original model allows a motif to start at any index i E {1, . . . ,L — w + 1}, 
we analyze a simplification of the model that assumes that motifs can only 
start at indices wi — w + 1 for i G {1, . .. ,L/w}. This simplification is done 
to facilitate analysis; however, we believe that our results are likely to hold 
for the original model as well. 

First, our rapid mixing result (Theorem 3.3) immediately holds for the 
original model and associated Gibbs sampler. This is because it is for the 
case of w = 1, where the original and simplified models are identical. 
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Second, the proof of our slow mixing result (Theorem 3.2) can be extended 
to the case where the model allows motifs to start at indices that are fixed 
distance > w apart. However, Theorem 3.2 does not easily extend to the case 
where motifs can start in locations that are less than w distance apart (in- 
cluding the original model). This is due to the following "phase shift" issue, 
which complicates analysis. For illustration consider the case where M = 4 
(there are four possible nucleotides) and w = 5 (motifs are five nucleotides 
long), and the true motif is (deterministically) the sequence (1,4,2,2,3). 
Phase shift means that it is possible to estimate that a motif begins or ends 
in the middle of one of the (1, 4, 2, 2, 3) subsequences that exist in the data. 
For example, if the DNA sequence S satisfies S22:26 — (1> 4, 2,2,3), corre- 
sponding to a true motif beginning at position 22, then the original model 
also allows for the possibility that A23 = 1, meaning that a motif could be 
estimated to instead start at position 23 with the sequence 4, 2, 2, 3. 

While phase shift complicates analysis of the original Gibbs sampler, it 
should also make the original Gibbs sampler converge more slowly than the 
simplified Gibbs sampler. The effect of phase shift on convergence of the 
original Gibbs sampler is that it can become trapped in a local mode of 
the posterior distribution that corresponds to a shifted version of the true 
motif. This effect is described in Lawrence et al. (1993) and Liu (1994). 
To illustrate, take the above example where the true motif is (1,4,2,2,3). 
There is a local mode of the posterior distribution for which the inferred 
motif starts with the sequence 4, 2, 2, 3, another for which the inferred motif 
ends with the sequence 1,4,2,2, and so on. This posterior multimodality 
slows convergence of the original Gibbs sampler. It also suggests that our 
slow mixing result (Theorem 3.2) for the simplified Gibbs sampler holds for 
the original sampler. 

Analysis of the original sampler should be possible using the same general 
approach taken here, but a number of the technical details would need to 
change. We leave this to future work. 

2.3. Markov chain convergence rates. Consider a Markov chain with 
transition matrix T and stationary distribution tt on a discrete state space 
X. For i£df and D C X, let T(x,D) = ^2 y( zr)T(x,y). If the chain is ini- 
tialized at x G X, then the total variation distance to stationarity after n 
iterations is 

\\T n (x, •) - 7r(-)|| TV ± max|T"(x, D) - «{D)\. 

The mixing time of the chain is the number of iterations required to be 
within distance e £ (0, 1) of stationarity, 

t £ = maxmin{n : ||T m (x, •) — 7t(-)Htv — e ^ or an m — n }' 
cf. Sinclair (1992). 
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Consider T irreducible, aperiodic, reversible and nonnegative definite, 
which holds for the random-scan Gibbs sampler in Section 2.1. Then r e 
is finite and closely related to the spectral gap Gap(T) = 1 — A2, where 
A2 G [0,1) is the second-largest eigenvalue of T. Since the state space X is 
finite, Gap(T) > and the chain is called uniformly ergodic [Roberts and 
Rosenthal (2004)]. The quantities t £ and Gap(T) are related via [Sinclair 
(1992)] 

r^GapCzW-ln 

(2.9) 

r £ > -(1 - Gap(T)) Gap(T)- i (-ln(2 £ )). 

The efficiency of the Markov chain can be measured by how quickly t £ 
increases as a function of the problem difficulty, for instance the dimension 
of the parameter space. In our case we are interested in the dependence of 
t £ on the length L of the DNA sequence, since in practice one analyzes very 
long sequences. We would certainly hope that t £ grows at most polynomi- 
ally in L for any fixed e; this property is called rapid mixing. Slow mixing 
means that t £ increases exponentially for some s. By (2.9) rapid mixing 
is equivalent to Gap(T) decreasing at most polynomially toward zero, and 
slow mixing is equivalent to Gap(T) decreasing exponentially toward zero, 
if — logfmin^g^ vr(x)] increases polynomially in L. The latter property holds 
for the random-scan Gibbs sampler in Section 2.1. The rapid/slow mixing 
distinction is a measure of the computational tractability of an algorithm; 
polynomial factors are expected to be eventually dominated by increases 
in computing power due to Moore's Law, while exponential factors cause a 
persistent computational problem. 

3. Convergence results. We consider the mixing time (equivalently, spec- 
tral gap) of the Gibbs sampler when the data are drawn from a generalization 
of the model given in Section 2.1 that allows multiple true motifs. First we 
give negative results for the case of multiple true motifs, and then we give 
a positive result for with no true motifs. 

3.1. Slow mixing for multiple true motifs. In this section, we show that 
if the data actually contain multiple true motifs, then the Gibbs sampler is 
slowly mixing: that Gap(T) is (D(a~ L ) for a > 1 where — log[min xe ^ vr(x)] 
is 0{L q ) for some q > 0. To make this statement precise in the presence of 
random data, we need to make assumptions about the model by which the 
data are generated. Our convergence results are obtained using Bayesian 
asymptotics based on this generative model; for other connections between 
Markov chain convergence and Bayesian asymptotics, see Kamatani (2011) 
and Belloni and Chernozhukov (2009). 

For a concrete example to keep in mind, consider the case where M = 4 
(there are four possible nucleotides) and w = 5 (motifs are five nucleotides 



min7r(:r) 

-XGX 



lne , 
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long). Then let the DNA sequence S be generated as the concatenation of 
many length-five subsequences, each of which is either (motif one) equal 
to (1,4,2,2,3) with probability 0.005, or (motif two) equal to (4,2,4,1,3) 
with probability 0.001, or generated as i.i.d. noise where each nucleotide 
has equal probability. Theorem 3.2 below says that the Gibbs sampler T is 
slowly mixing for data generated in this way. 

When analyzing the Gibbs sampler we do not assume that the data S 
are generated according to the inference model (2.7) and (2.8). Our most 
general result only assumes that the subsequences Swi-w+i-.wi are i.i.d. 

Assumption 3.1. The data subsequences Sm-<w+i:wi indexed by i 6 
{1, . . . , L/w} are independent and identically distributed according to some 
probability mass function g(s) > : s € {1, . . . , M} w , that is, 

L/w 

S ~ J^J g(S u ,i— w+l:wi) ■ 
i=l 

Under Assumption 3.1, we give a simple sufficient condition for slow mix- 
ing that relates the generative model g(s) to the inference model f(s\0Q- w ) 
via the quantity Elog f(s\Oo :w ) = ^ s g(s) log /(s|^o:u;)- Since 9^ for k € 
{0,... ,w} is defined on the simplex ^ = {0 k : J2m=i ®k,m = 1, 0k,m > 0}, the 
quantity Elog f(s\Oo :w ) S [— oo,0) is a function of 0q- w € \E rU,+1 . It is contin- 
uous in 6q :w , because it is a linear combination of a finite number of the con- 
tinuous functions log f{s\Oo-.w)- We call t](6q :w ) = £4og /(s|0o :U) ) multimodal 

if there exist fl&^X™ G $> w+1 and bounded sets F 1 3 0^ and F 2 3 0q2> 
such that 

(3.1) F l nF 2 = and sup 7/(fl :«,) < vVoL), J€{1,2], 

where dFj = cl(Fj) fl cl(Fj) is the boundary of Fj. Equation (3.1) implies 

(i) 

that 6q. w is in the interior of Fj. For a continuous function on a closed, con- 
nected subset of M. d (like ^ w+l ) this definition of multimodality is weaker 
than the existence of multiple strict local maxima and stronger than the 
existence of multiple local maxima. We call a function h of 0q :w a multimin- 
imum function if —h(Oo :w ) is multimodal. 

Theorem 3.1. Under Assumption 3.1, if the function Elog /(s|0o: TO ) of 
9q :w is multimodal then the spectral gap of the Gibbs sampler T decreases 
exponentially in L, almost surely. 

Theorem 3.1 will be proven in Section 5. It uses asymptotic results on 
the behavior of the posterior when the data are not distributed according 
to the inference model [Berk (1966)]. We will see that when £ , log/(s|0o :lu ) 
is multimodal the posterior distribution is also multimodal for large L, and 
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that the heights of the modes relative to the heights of the valleys in between 
grow exponentially in L, causing the slow mixing. This is due to the fact 
that, using (2.7), the log-likelihood is 

L/w 

log7r(S|0 O:u ,) = ]Tl°g/(S 

i=l 

which satisfies the following for any 6q- w : 

L/w 

-——y^logf(S wi - w+ i lwi \Oo: W ) ^ Elogf(s\0 , w ) a.s. 
L/w ^-^ 

i=l 

by the strong law of large numbers. This effect leads to the likelihood func- 
tion being multimodal for large L. Statistically, these correspond to multiple 
values of 0q :w that explain the data well. 

Another way of stating Theorem 3.1 is via the Kullback-Leibler diver- 
gence between f(s\Oo :w ) and g(s). The divergence measures the degree of 
difference between f(s\6o- w ) and g(s) and is defined as 

V 5 s log— — 7 

^ f{s\V0:w) 

(3.2) 

= E 9 & lo §5 , ( s ) - E 9 ^ log f( s \°o--w)- 

s s 

Since ^ s g(s) log g(s) does not depend on Oo-_ w , the divergence is a multi- 
minimum function iff E\ogf(s\Oo :w ) is multimodal. 

Corollary 3.1. Under Assumption 3.1, if the Kullback-Leibler diver- 
gence (3.2) is a multiminimum function of 0q- w then the spectral gap of the 
Gibbs sampler T decreases exponentially in L, almost surely. 

Next we show that multimodality of Elogf(s\0Q :w ) occurs when the gen- 
erative model g(s) includes J > 1 true motifs, described by position-specific 
frequency matrices 0-[* w for j € {1, . . . , J}. Assumption 3.2 says that g(s) is 
obtained by extending the inference model (2.8) to the case of J > 1 motifs. 

Assumption 3.2. The p.m.f. g(s) from Assumption 3.1 satisfies 

J w I J \ w 

(3.3) ga . (s) =2>II C + 1 " Eft" n °ls k , 

j=l k=l \ j=l I k=l 

where J > 1 and: 

(1) pj > for j G {1, ... , J} are the motif frequencies, where ^/=iPj < 1> 

(2) Oq is a background probability vector with Yl m =i m = 1 an d ^om > "' 

(3) 0\* w for j € {1, . . . , J} are position-specific frequency matrices. 
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Due to the complex form of E log f(s\6o :w ) = Y^ s do* ( s ) log f(s\6o :w ) under 
Assumptions 3.1 and 3.2 it is difficult to characterize the number of modes 
without making any additional assumption. We will restrict our analysis to 
the case where for each j S {1, . . . , J} and k 6 {1, . . . , w} there is some m € 
{1, . . . , M} with 9 :> k * m = 1. This means that each true motif is a fixed length- w 
sequence of nucleotides, for example, where w = 5 and M = 4 and the first 
and second motifs correspond to the deterministic sequences (1,4,2,2,3) 
and (4,2,4,1,3), respectively. The case without this restriction is discussed 
below. 

Assumption 3.3. For each j e {1, . .. , J} and each k G {1, . . . ,w}, there 
is some t J k € {1, . . . , M} for which Q 3 * ^ = 1. Also, 

k,t k 
1 - 

(3.4) a = minliminf — > 1, ,■ , > 0. 

j+j' w^oo w ^ K&i } 
k=l 

Assumption 3.3 says that the motifs are deterministic in the above sense, 
and that for any two motifs j ^ j' the proportion of differences between the 
motif sequences does not decay to zero as w grows. This ensures that the 
motifs are different enough from one another for large w to cause a mixing 
problem in the Markov chain. With these assumptions, we have multimodal- 
ity of -Elog/(s|0o:iu) f° r large enough w. 

Lemma 3.1. Under Assumptions 3.1-3.3 there exists w* < oo that de- 
pends on po, Oq, J, {pj}j = i, and a [as in equation (3.4)] such that the 
following holds. If w >w*, then Elog f(s\6o :w ) is multimodal with at least 
J > 1 local maxima. 

Lemma 3.1 is proven in the supplementary material [Woodard and Rosen- 
thal (2013)]. Combining Theorem 3.1 and Lemma 3.1 immediately yields our 
main result: slow mixing for the case of multiple true motifs. 

Theorem 3.2. Under Assumptions 3.1-3.3, there exists w* < oo such 
that whenever w > w* the spectral gap of the Gibbs sampler T decreases 
exponentially in L, almost surely. 

While Theorem 3.2 is stated for w large enough and assumes deterministic 
true motifs, the simulation results in Section 4 suggest that slow mixing 
occurs even for nondeterministic true motifs and for w as low as six. 

Theorem 3.2 says that the presence of multiple motifs in the generative 
model contradicts the inference model assumption of a single motif, causing 
slow mixing. In realistic biological situations there are frequently multiple 
motifs, corresponding to multiple gene regulatory binding motifs or to re- 
peating patterns that have other biological significance [Neuwald, Liu and 
Lawrence (1995), Roth et al. (1998)]. Theorem 3.2 says that these patterns 
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do not have to occur often in order to cause slow mixing (that slow mix- 
ing occurs even when some of the pj are very small). So when L is large 
the Gibbs sampler should be used only as a tool for generating candidate 
motifs, and the results cannot be interpreted as obtaining samples from the 
posterior distribution, or used for Monte Carlo estimation. 

Theorem 3.2 assumes that each true motif is a deterministic sequence; 
now consider the case of variable true motifs, that is, where Assumptions 3.1 
and 3.2 hold but not Assumption 3.3. We give an informal argument sug- 
gesting that the function Elogf(s\Oo :w ) is still multimodal. 

Consider the case where X^/=iPj = Po- Then, using (2.8) and (3.3), 

w w 
. fc=l k=l 

So Elog f(s\0Q ]W ) can be written as 

J 

(3.5) Yl 90' ( s ) lo S = J2 - E /( s l ( o> CJ) f(s\0o-. w ). 

s j=l P0 s 

By standard information-theoretic results [Kullback (1959), Berk (1966)], 
for each j the function ^ s /(s|(0q 5 0{* w )) l°g f( s \^o-.w) has a unique global 
maximum at 6q ]W = (Oo,0-[* w ). Since (3.5) is the weighted sum of continuous 
functions that have global maxima occurring at distinct locations (6q, #{*„,), 
it seems likely that (3.5) is multimodal when J > 1. 

3.2. Rapid mixing for < 1 true motif. Simulations suggest (Section 4) 
that when there is no more than one true motif the Gibbs sampler is rapidly 
mixing, that is, Gap(T)- 1 = 0(L q ) for some q > 0. We have one theoretical 
result in this direction, showing rapid mixing for the case w = 1. In this case 
any true motif is indistinguishable from the background signal, so there are 
effectively zero true motifs. 

Theorem 3.3. If w = 1, then the Gibbs sampler T has spectral gap that 
decreases polynomially in L, uniformly over S € {1, . . . , M} L . Specifically, 
for M = 2, 

sup Gap(T)^ 1 = 0(L U ) 
Se{i,...,M} L 

and for fixed M > 2 the same result holds for a larger-degree polynomial. 

Theorem 3.3 (proven in the supplementary material [Woodard and Rosen- 
thal (2013)]) shows rapid mixing in the worst case over possible datasets S. 
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Contrast with Theorems 3.1 and 3.2, which show slow mixing almost surely 
under a particular generative model g(s). 

It is likely that the spectral gap bound given in Theorem 3.3 is very 
loose as a function of L, since the tools that we use to obtain it (Theo- 
rem B.4 in particular) can be imprecise. However, obtaining a tighter bound 
would require substantially longer arguments, so we leave this to future 
work. Additionally, one could assume a particular distribution for S and use 
an average-case analysis to obtain a tighter bound, but also one that would 
have narrower interpretation. 

4. Simulation study. We simulate data with either J = 1 or J = 2 true 
motifs, and measure the convergence of the Gibbs sampler. The data are 
simulated as follows; to emulate DNA data we take M = 4. The true position- 
specific frequency matrix 9^. w for each motif j is obtained by drawing its 

columns 0j? independently for k € {1, . . . ,w} from a Dirichlet distribution 
with parameter vector a± (chosen as described below). The background 
frequency vector Oq is drawn from a Dirichlet distribution with parameter 
vector cxq. We also define the motif frequency to be pj = 0.005 for each motif 
j (a typical value in practice). With these definitions, the data vector S is 
obtained by drawing each subsequence S w i- w+ i-, w i for i € {1, ... , L/w} from 
(3.3), using various combinations of w and L. Unlike Assumption 3.3, here 
we use variable motifs, meaning that 0{ 7^ 1 for all k and m. We have also 
done experiments with other values of pj (pj = 0.003 and pj = 0.02), which 
gave qualitatively the same results. 

We choose o.q and ol\ so that the distribution of 6q and that of <?{* 
is symmetric in the four nucleotides; this means that we must have cxq, = 
ao x (1,1,1,1) and cti = aq x (1,1,1,1) for some ao,ai > 0. Since motifs 
are by definition fairly well conserved, we choose aq so that the median of 
m&x m£ {i i 41 J k * m is 0.95 (a± is found numerically). Since background data 
are typically more balanced among the four nucleotides, we choose ao so 
that the median of max^.^ i 4 i 6>q m is 0.3. 

For each simulated data vector S we run a systematic-scan Gibbs sampler 
five times from different initial values and use the Gelman-Rubin scale factor 
[Gelman and Rubin (1992)] to detect whether the chains have converged to 
different parts of the parameter space, corresponding to different local modes 
of the posterior density. Since the slow mixing in Theorem 3.2 is caused by 
multimodality of the posterior distribution, this approach should detect the 
problem effectively. If different runs of the Markov chain explore different 
parts of the parameter space, the Gelman-Rubin scale factor should be 
large (typically much larger than 2), while if they are drawing from the 
same distribution the scale factor should be close to 1. 

In order to detect the worst-case behavior, we take the initial vector 
A = (Ai, . . . , Al/ w ) for the first chain to be the vector of indicators of 
whether each subsequence S w i- w+ i :w i was generated from motif one. If appli- 
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cable we initialize the second chain at the vector of indicators of whether each 
subsequence S W i- w +i tw i was generated from motif two. The initial vector A 

in all other cases is generated randomly according to A4 Bernoulli (po). 
Although in practice one would not know the true motif locations, we do 
this to ensure that we detect even very narrow and hard-to-find modes cor- 
responding to the true motifs. We run each Gibbs sampler for a burn-in 
period of 1000 updates of the entire vector A, and then a sampling period 
of 10, 000 updates of A. With these choices standard convergence diagnostics 
[cf. Geweke (1992)] that evaluate the convergence of the chains individually 
do not detect a convergence problem. 

When we run the Gibbs sampler we specify the inference model motif 
frequency po as po — J2j=iPj (other choices are investigated below). We 
specify the prior hyperparameters as fik,m = 1 for k € {0, . . . ,w} and m € 
{1, . . . ,4}; this is the standard choice. 

Having simulated the chains, we calculate the Gelman-Rubin scale factor 
for the following parameter summaries: 

0A:,m(A) = — - , , ' k E {1, . . . ,w},m e {1, . . . ,4}, 



'0,m{ 



|N(AW)| + |/3, 
iV(A c ) m + /3 , ; 



|N(A<0| + |/3 o | 

as well as for |A|, recalling the notation (2.1) and (2.2). The values 9). m (A) 
and #o,m(A) are relevant because they are the posterior means of 6k, m and 
$o,m given A. The posterior density estimates of #2,1 (A) from two different 
Markov chains in the case J = 2 are shown in Figure 2. The Gelman-Rubin 
scale factor for these chains is 10.9, accurately reflecting the fact that the 
two chains have converged to different parts of the parameter space. 

The top display in Table 1 addresses the case of one true motif ( J = 1) 
and various combinations of w and L. For each combination it reports the 
percentage out of 20 simulated datasets for which the maximum Gelman- 
Rubin scale factor (over the different parameter summaries) is greater than 
1.5. The bottom display in Table 1 reports the same quantities for the case 
of two true motifs ( J = 2). For one motif no convergence problem is detected 
for any of the simulated datasets. For two motifs, regardless of the value of 
w, there is a severe convergence problem for large values of L. 

Finally, we investigate the effect of other choices for po- Specifying po = 
0.005, po = 0.002 or po = 0.02 yields results that are qualitatively the same 
as those in Table 1. 

5. Proof of Theorem 3.1. 



5.1. Specification of the Gibbs sampler. Here we give the details of the 
Gibbs sampler T. Recalling the notation of Section 2.1, the sampler iter- 
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Table 1 

For the cases of one true motif (top) and two true motifs (bottom), 
the percentage of simulated datasets for which the Gelman-Rubin 
scale factor from five Gibbs sampling chains is > 1.5 





iu = 6 


w = 10 


w = 15 


J = l 
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100 


L/w = 8000 
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atively updates each Ai according to its conditional posterior distribution, 
given as follows where Ar_ji refers to the vector A excluding the ith element, 
where Auq] is the vector A with the ith element replaced by 0, and where 
Arj^i is A with the ith element replaced by 1. Using (2.5), 

tt(A = i|a m ,s) 



tt(A = 0|A H] ,S) 

Po /rXNtA^,) + /3 )\ r(|N(Af. 0] )| + \(3 \ 



1 -poVr(N(Af. 0] ) +(3 )J r(|N(A^ 1] )| + |/3 | 



-w-\-k 



(5.1) x T 

tt x |N(Ag>)| + |/3 fe | 

" 1 -po \r(N(Af. 0] ) +/3 ) ) rWAjyi + |/3 |) 11 



fc=i 

i€{l,...L/«;}, 



where the elements of the vector Ok are the current estimates of the frequency 
of each nucleotide in position k of the motif, that is, 

(5.2) 9 Km ± ^ , k£{l,...,w},m£{l,...,M}. 

|N(A^)| + |/3 fc | 

For details see Liu, Neuwald and Lawrence (1995) and Jensen et al. (2004). 
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5.2. Outline of proof of Theorem 3.1. Informally, the proof of Theo- 
rem 3.1 proceeds by showing the following results. 

Step 1. The order of the spectral gap of the Gibbs sampler is determined 
by the unimodality or multimodality of the marginal posterior distribu- 
tion of a particular summary vector C(A) of A, denoted by 7f(C(A)|S). 
If 7f(C(A)|S) is multimodal the order of the spectral gap is determined by 
the heights of the modes relative to the heights of the valleys between the 
modes. 

Step 2. When E\ogf(s\0Q :w ) is multimodal, the marginal posterior dis- 
tribution tt(Oq :w \S) of the continuous parameters 6q- w is also multimodal, 
with height of the modes increasing exponentially in L, relative to the height 
of the valleys between the modes. 

Step 3. The result of step 2 can be mapped to 7f(C(A)|S), showing that 
the posterior distribution of C(A) has multiple modes with height that 
grows exponentially in L (relative to the valleys in between). 

For simplicity of notation we consider the case M = 2 (two nucleotides), 
although the proof is analogous for any fixed M. 

Formally, for s 6 {1,2}™ any length- w vector of nucleotides define 

(5.3) C(A) S 4 \{i :Ai = l, S wi - w+1:wi = s}\ 

to be the number of instances of motif s (where |{. . .}| indicates the cardi- 
nality of a set). Similarly, let 

(5.4) C r (S) B 4|{i:S t0i _ tt+lwi = s}| 

be the number of times that the sequence of nucleotides s occurs in the 
data. Then we must have C(A) S < C(S) S for each s, that is, C(A) lies in 
the space 

(5.5) X± H {0,...,C(S) B }. 

se{i,2} w 

The posterior distribution 7r(AjS) only depends on A through C(A), 
which can be seen as follows. Using (2.5), 7r(A|S) depends on A via the 
quantities |A|, N(A^'^) and N(A C ). These in turn only depend on C(A), 
since [using (2.1)-(2.2) and (5.3)] 

|A| = |C(A)|, 

(5.6) N(AM) m = ^C(A) s l K , =m} , k G {1, . . . ,w},m e {1,2}, 

s 

UI 

iV(A c ) m =iV(S) m -^iV(A( fc )) m . 

k=l 
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Fig. 3. The log-density log7r as a function of c^i) and C( 2 ,2), fixing cji^) = C(2,i) = 
and for the case w = 2. 



The marginal posterior distribution of C(A) is denoted by 

(5.7) 7f(c|S)= n ( A \ S )i ce *- 

A:C(A)=c 

Figure 3 illustrates multimodality of ff (c|S) for the case w = 2. For w = 2 the 
arguments to the function n are the quantities cnu, cm |2 ), c^i) and C( 2) 2)- 
The data S used to create Figure 3 were generated with two true motifs, 
yielding the two visible modes of tt. 

We will use Theorem B.l to bound Gap(T). Partition the state space X 
of T according to the value of C(A), 

(5.8) D c = {Ae^:C(A) = c}, ceX. 

Define the projection matrix (Appendix B) for T with respect to this parti- 
tion: 

(ci,c 2 j = ^ „/\\g\ 

(5.9) i 

A:C(A)=ci ' Cl ' 

since 7r(A|S) depends on A only via C(A), so that tt(A|S) is equal for all 
A E D Cl . The matrix T is reversible with respect to 7f (Appendix B). 

It is easier to obtain useful bounds on Gap(T) indirectly by relating T 
to T and bounding Gap(T) than it is to obtain useful bounds on Gap(T) 
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directly. The same technique is utilized in Madras and Zheng (2003) and 
Woodard, Schmidler and Huber (2009a). The cardinality of the state space 
X = {0,1} L / W of T grows exponentially in L for fixed w, while the cardinality 
of the state space X of T grows only polynomially in L, since [using (5.5)] 

(5.10) \X\ < (L/w + lf w . 

We obtain an upper bound on Gap(T) using conductance (Theorem B.2) 
and a lower bound on Gap(T) using path bounds (Theorem B.4). Bounds 
obtained using these tools can easily be inaccurate by a factor equal to 
the cardinality of the space. If we were to obtain these bounds directly for 
Gap(T) they would be loose by an exponential factor in L, and thus unusable 
for our purposes. 

5.3. Step 1 of proof of Theorem 3.1. Consider the graph associated with 
the reversible matrix T, with vertices corresponding to c G X and edges 
corresponding to pairs C\,C2 G X having T(ci,C2) > 0. For any Cx,C2 G X 
let r cijC2 denote the set of paths between ci and C2 in the graph that do 
not have repeated vertices. Also let c G 7 indicate that the state c G X is a 
vertex in the path 7. Then Theorem 5.1 formalizes step 1 from Section 5.2. 

Theorem 5.1. Gap(T) decreases exponentially in L if and only if 

A . . TffclSl 



(5.11) d= min _ max min _ 

ci,c 2 eA , 7Gr Cl ,c 2 C67 7r(ci|S)7r(c 2 |S) 

decreases exponentially in L. 

The quantity d measures the multimodality of tt. Roughly, think of ci,C2 
as being local modes of ff; if all paths between ci and C2 contain a state 
with low probability, then d is small. 

Proof of Theorem 5.1. The transition matrix T is nonnegative defi- 
nite and reversible with respect to 7r(A|S). Notice that T 2 is also reversible 
w.r.t. 7r(A|S). Using (5.8), let T 2 \d c be the restriction of T 2 to D c as defined 
in Appendix B. Then by Lemma B.l and Theorem B.l, 

Gap(T) > - Gap(T 3 ) = - Gap(r 1 / 2 r 2 T 1 / 2 ) 

(5.12) 3 3 

> -Gap(T) minGap(T 2 | Dc ). 

Combining with Proposition 5.1 below, we have that 

(5.13) Gap(T)- 1 = ©(GapCf)" 1 x L 2w+Z ). 

Theorem B.l also gives the bound Gap(T) < Gap(T). So Gap(T) is within a 
polynomial (in L) factor of Gap(T). Theorem 5.1 then follows from Propo- 
sition 5.2 below. □ 



CONVERGENCE RATE OF MARKOV CHAINS 



21 



Finally we give several results used in the proof of Theorem 5.1. 
Proposition 5.1. We have [min cg ^ Gap^lnJ] -1 = 0(L 2w+3 ). 

Proof. Take any c € X. For s e {1,2} W such that C(S) S > let X s = 
{z 6 {0, 1}°^ :Y.i z i = c s}- Using (5.4) and (5.8) the subvector of A € D c 
defined by (Ai : S w i- w+ \ :w i = s) takes values in the space X s . So there is 
some bijective map h such that 

(5.14) {h(A):A€D c }= ]J X s . 

se{l,2}«":C(S)s>0 

Define a transition matrix T having state space rise{i 2} w ■ c(S) s >o ^ and 
elements f (h(A), h(A')) = T 2 \ Dc (A, A') for all A, A' € D c . Then we have 

(5.15) Gap(T) = Gap(T 2 | Dc ). 

Using (B.2) and the fact that T 2 is reversible w.r.t. 7r(AjS), T 2 \d c is 
reversible w.r.t. 7r|£) c (A|S). Since 7r(A|S) is equal for all A € D c , 7r|^ c (A|S) 
is uniform on D c , and thus T is also reversible w.r.t. the uniform distribution. 

We will compare T to a product chain as defined in Theorem B.3, denoted 
by T* . Define T* to have component chains indexed by s € {1, 2} w : C(S) S > 
0. The component chains are denoted by T*, have state space X s and are 

combined using weights b s = -jjj^- to form T* . Define T* to be the exclusion 

process with c s particles on the complete graph of {1, . . . , C(S) S } [Diaconis 
and Saloff-Coste (1993)]. This chain is defined informally as follows, where 
z € X s is the current state. If c s = or c s = C(S) S , then X s has a single 
state, and the transition matrix T* is trivially defined. Otherwise, T* chooses 
an index j uniformly at random from {i G {1, . . . , C(S) S } : %i = 1}. Then it 
chooses an index £ 6 {1, . . . ,C(S) S } uniformly at random. If zi = 0, then 
Zj is changed to 0, and Z£ is changed to 1; otherwise, the state z does not 
change. The matrix T* is reversible with respect to the distribution fj, s that is 
uniform on X s [Diaconis and Saloff-Coste (1993)]. So by Theorem B.3, T* is 
reversible with respect to the uniform distribution on O s e{i 2} w ■ c(S) s >o 

By Theorem 3.1 of Diaconis and Saloff-Coste (1993), for c s > we have 
Gap(T s *) > 1/cg, while for c s = 0, Gap(T*) = 1. Then Theorem B.3 together 
with c s < C(S) S yields 

Gap(T*)= min 6 s Gap(T*) 

(5.16) 

> min 

se{i,2} w ■. C(S) s >o 

By (2.6) and the fact that C(A) = C(A') = c for A, A' G D c , T 2 (A, A') > 
if and only if 3s £ {1, 2} w : C(S) S > such that A' differs from A only by 
swapping two elements of the subvector ( A-i : S W i-w+l:wi = s )- Swapping two 
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elements of a vector in X s is precisely the move made by the transition 
matrix T*. So 

(5.17) T*(h(A),h(A')) >0 iff T 2 (A,A') >0 for A,A'eD c :A^ A'. 
Using (B.l), 

(5.18) T 2 | Dc (A,A') = T 2 (A,A') VA, A' 6 D c : A ^ A'. 
Also, by Lemma 5.1 below, 

(5.19) di=\ min T 2 (A, A')l ^ = 0{L 2w+2 ). 

La,A'gA':T 2 (A,A')>0 J 

Combining with (5.17) and (5.18), for any A, A' € D c such that A ^ A' and 
T*(h(A),h(A'))>0, 

T(/i(A), h(A')) = T 2 (A, A') > d^ 1 > d^T*(h(A), h(A')). 

For A, A' € D c such that A / A' and T*(h(A), h(A')) = the same inequal- 
ity holds trivially. So 

T(/i(A),/i(A')) >^ 1 T*(/i(A),/i(A')) V/i(A) ^ /i(A') e A" s . 

s6{l,2}»:C(S) s >0 

By Lemma B.2 and the fact that both T and T* are reversible with respect 
to the uniform distribution on n s e{i 2} w ■ c(S) s >o we then have Gap(T) > 
d ] ~ 1 Gap(r*). Combining with (5.15) and (5.16), 

Gap^ 2 !^)" 1 = Gap(T)- 1 < di Gap(T*)~ 1 

< diL 
~ w 

regardless of the value of c. By (5.19) d\ does not depend on c, so 
max ce ^Gap(T 2 | Dc )- 1 = 0(L 2 «'+ 3 ). □ 

Lemma 5.1 was used in the proof of Proposition 5.1. 

Lemma 5.1. We have 

min T(A,A')}~ 1 = 0(L W+1 ). 

A,A'eX:T(A,A')>0 - 1 

Proof. Recall the definition of 9k, m from (5.2). Using (5.1), 
7r(A = l|A[_ fl ,S) 



-w-\-k 

k=l 
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w 



>mm<! -, \p [[9 k , Swi _ w+k 
\ k=l , 



So 



/ [po II ks wi - w+k + (1 -po)(r(N(Af. 0] ) + /3 )r(|]Nr(Af. )1] )| + |/3 |)) 
(5-20) /(r(N(Af. 1] )+/3 )r(|N(Af. 0] )|+|/3 |))j 

k=l 

/(2(1 - Po )(r(N(Af. 0] ) +/3 )r(|N(Af. 1] )| + |/3 |)) 

/(r(N(Af. 1} ) +/3 )r(|N(Af i)0] )| + |/3 |))) 
Also, by (2.1) and the definitions of A[ i)0 ] and A^j, 

w 

N(A c m ) m = N(A c [hl] ) m + £ l {Sroi _ w+fe=m} , m € {1, 2}, 

fc=i 

|N(Af i)0] )| = |N(Af. 1] )| + U ;. 

r(N(A^ 0] ) + /3 )r(|N(A^ 1] )| + |/3 |) 
r(N(A^ 1] ) + /3 )r(|N(Af. 0] )| + |/3 |) 

= ( II ( N ( A [i,l])m + /8b,m)(JV(Af M] ) m + Po,m + 1) 
\m=l 

• ' • (^(Afi,!])^ + Po,m + E Ms^^m} ~ 1) 

k=l / 

/(m^])\+\0o\)(w^,i])\+\0o\+i) 

•■•(|N(Af a] )| + \/3 \ + w-l)) 

< ( H (|N(Af. + |/3 |)(|N(Af. ^1 + |/3 | + 1) 

\m=l 

• • • (V(A^)| + |/3 | + p l {Swi _ w+h = m} ~ lj ^ 

/((\N(Al hl] )\ + \(3 \)(\N(Af hl] )\ + \(3 \ + l) 
...(\N(A c [hl] )\ + \p \+w-l)) 

< 1. 
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Combining with (5.2) and (5.20) 



Po Uk=i $k,S, 



W a 



tt(A = 1|A m ,S)> 




PO TT Pk,S, 



2 



which does not depend on A or i. So [minA,i ir(Ai = l|Ar_£i,S)] 1 = 0(L W ). 
Analogously, [min A ,i ^(A = 0|A H] , S)]" 1 = 0{L W ). Using (2.6) then yields 
the desired result. □ 

Proposition 5.2 was used in the proof of Theorem 5.1. 

Proposition 5.2. Gap(T) is within a polynomial (in L) factor of d. 



Specifically, Gap(T) = 0{dxL 2W ) and Gap(T)" 1 = 0(d~ x x L «>+i+2 w+1 +2 w ). 



The bounds in Proposition 5.2 only rely on Lemma 5.1 and the fact (5.10) 
that \X\ grows polynomially in L, and could be improved by leveraging ad- 
ditional properties of T (at the cost of some technical complexity). However, 
Proposition 5.2 is sufficient for our purposes. 

Proof of Proposition 5.2. 

Upper bound. Suppress the dependence of vr(c|S) on S for simplicity of 
notation. Let ci,c 2 G X be a pair of states that achieve the minimum in 
definition (5.11) of d, so that d = max 7g r Cl C2 min cg7 -^j^| c ^ . For any 7 6 

rci,c 2 ) let c 7 — argmin cg7 # ( C ^( C2 ) (in the case of a tie choose the state 
earliest in the path). Defining the set E = {c 7 : 7 £ r ci ,c 2 }j we have 

(5.21) d = max _ . \ _ 7 / . . 

c 7 e£ 7r(ci)7r(c 2 ) 

The set E separates ci and c 2 in the sense that there is no path 7 £ r ci]C2 
that does not include some state in E. If ci € E, then there is some 7 € r ci]C2 
for which c 7 = ci, and so d > > 1- In this case Gap(T) < 2d(L/w + l) 2 * 

holds since Gap(T) < 2. 

Now consider the case ci ^ E. Let B be the set of states c € X that are 
not reachable from Ci without going through E, 

B = {c € X : V7 G r cijC there is some c' G 7 s.t. c' G E}. 

We have that c 2 G .B, and ci G -B c since ci ^ E. Also, the only states ceB 
for which T(c, B c ) > satisfy c G E, which can be seen as follows. Otherwise, 
3c G B \ E and C3 G B c for which T(c, C3) > 0. Since C3 G B c , there is a 
path 7 G r C i,c 3 that does not go through E. But since T(c3,c) > (T is 
reversible), there is also a path 7 G r C i,c that does not go through E, which 
is a contradiction. 



IS 
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Using these facts, (5.10) and (5.21), the conductance of B (Theorem B.2) 



tt{B)%{B c ) 

< EcgE^( C ) < Ec€E^( C ) 
~ 7t(B)Tt(B c ) - 7f(ci)7f(c 2 ) 

7f( C ) 



Tt(B)Tt(B c ) 



< \E\ max ■ 



E\d<\X\d<d(L/w + iy 



ceE 7r(ci)7r(c2) 

Using Theorem B.2, Gap(T) < 2$ t (B) < 2d(L/w+_l) 2W as claimed. 

Lower bound. Recall that the transition matrix T has state space X and 
is reversible with respect to 7f. Using (5.8)-(5.9), for c,c'e^ such that 
E s ! c s - c sl < 1> T(c,c') > and otherwise T(c,c') = 0. Also 

(5.22)T(c,c') > min T(A,D C >) > min max T(A,A') Vc,c'eX. 

Aefl c A6D c A'e£> c / 

If T(c, c') > 0, then for every A G D c there is some A' G D c / with T(A, A') > 
0. By Lemma 5.1 and (5.22), 

-l 



(5.23) 



min T(c,c') 

c,c'eX: T(c,c')>0 



0(L 



UI + l^ 



We will use Theorem B.4 to obtain a bound for Gap(T); let £ be the set of 
edges in the graph of T. For (z, v ) G £ and 7 a path in the graph, let (z, v) G 7 
indicate that the edge (z, u) is in the path 7 (as distinct from z G 7 which 
indicates that the vertex 2 is in 7). To apply Theorem B.4 we need to define 
a path j Xt y for every pair of states x,y G X. Suppressing the dependence 

of 7f on S, choose 7-^ to be any path that maximizes min^^g^ ^ . 

Letting |{. . .}| denote the cardinality of a set, the path constant p defined 
in Theorem B.4 satisfies 



1 



< 



< 



max — - 

(z,v)e£ ir{z)T(z,v) 

1 

mi ^-(z,v)e£ T(z,v) 
1 



7r ( x ) 7r (y) len (7x-,y) 



max —— Tr(x)ir(y) 



maxlen(7 x y ) 



min {ZiV)££ T(z,v) 



max max 



max \{>y x>y 3 (z,v)}\ 
{z,v)&£ 



7r(x)vr(y) 



1 



maxlen(7 



J'..y 



min {ZiV)e£ T(z,v) 



max \{y x , y 3 (z,v)}\ 

{z,v)&£ 
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-1 



mm mm 



tt(z) 



x,y€X {z,v)e-y x , y 7r(x)7r(y) 
1 



maxlen(7 x?/ ) 

x,y 



max \{j xy B (z,v)}\ 



m (z,v)ee T {z,v) l{z*v)e£ [ L IJ, ' V 



x 



mm max mm 



< 



x,y€X-reT x , y (z,v)e-y 7r(x)7r(y) 
1 



maxlen^™) 



mm 



(z,v)££T(z,v) l(z,v)€E 



max \{y x , y 3 (z,v)}\ 



mm max mm 



7r(z) 



x,yGXl&r x>y z£7 7r(x)7r(y) 



maxlen(7 



From (5.10) = 0(L 2W ), so the maximum length of paths is 

max x ^ len(7 Xi y) = C(L 2 ), and the total number of paths is no more than 
\X\ 2 = 0{L 2W+1 ). By Theorem B.4 and (5.23), Gap(f)- 1 < p = 



1 tw+1+2 w+1 +2 w ^ 



□ 



5.4. Step 2 of proof of Theorem 3.1. Recalling that M = 2, there are 
w + 1 free parameters 6k,\ G [0, 1] for k G {0, . . . , w}, so we write 

(5.24) 6 0:w e[0,l] w+1 . 
Theorem 5.2 formalizes step 2, using Theorem A.l. 

Theorem 5.2. Under Assumption 3.1, if E log f(s\6o :w ) is multimodal, 
then there exist e > and two sets B\,E>2 C [0, l\ w+1 separated by Euclidean 
distance e such that 

tt^q^^U^IS) 7r(g 0: tt,gfliUfl 2 |S) 
1 ' tt^G^IS) ^ vr^G^IS) 

decrease exponentially in L, almost surely. 

Proof. The inference model assumes S^j-^+i^j f{s\9o :w ) for i G 
{1, . . . ,L/w}. By Assumption 3.1, the generative model assumes 

S w i- w +i; W i g(s), fitting into the framework of Theorem A.l. Using the 
notation of that theorem, consider the case where t](6q :w ) = Elogf(s\6o- W ) is 

multimodal. Then there are { o j) w G [0, 1} W+1 and Fj C [0, 1} W+1 for j G {1, 2} 



such that 0^ G Fj and (3.1) holds. Let (f)j = 
j G {1, 2}, and take any £ for which 



(5.26) 



£ G I min 6~, min r?(#n 

Ve{i,2} 3 je{i,2} 



sup aF . ^(6*0:^) < ^o-i) for 



0) 



So £ > 0j for some j G {1,2}; assume WLOG that £ > 
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Define the sets 

v^{e 0:w e[o,i] w+1 : V (e : W )>c}, 

(5-27) 

B 1 = F 1 nV, B 2 = V\F 1 . 
By (5.26), rjiO^J > £ and 0^ € B x . Also using (3.1), 7/(0^) > £ and 

0S,GF 2 nUc5 2 , so 

(5.28) V = BiUB 2 and sup 7?(0 O :™) > £, j€{l,2}. 

If > such that B\ and B 2 are separated by distance e, then (since 
[0,l] w+1 is compact) cl(Bi) n cl(5 2 ) / 0. By (5.27) i?i C F Y and 5 2 C 
[0,l] tu+1 \ Fi so cl(-Bi) n cl(-B 2 ) C dF\. This is a contradiction since, due 
to (5.27)-(5.28) and the continuity of 77, 

inf 77 > inf 77 = inf 7/ > £ > 0i = sup 77. 

cl(Bi)ncl(B 2 ) cl(Bi)Ucl(B 2 ) BiUB 2 SFi 

So 3e > such that B± and -B 2 are separated by distance s. 

In order to satisfy assumption (2) of Theorem A.l we remove points from 
the space [0, l] w+1 of 6 .. w for which 3s G {1, 2} w : f(s\0 0:w ) = 0. This results 
in the space 

(5.29) A = ((0, 1) x [0, If) U ([0, 1] x (0, l) w ). 

The fact that f(s\6Q :w ) > for all s and all 0q- w G A is a consequence of 
(2.8): if #0,1 £ (0,1), then f{s\6o :w ) > for all s, and the same holds when 
m g(O,1) for all ke{l,...,w}. 
Taking 

(5.30) Bj = Bj DA, jG {1,2} 

B\ and B 2 are separated by distance e. Due to (3.1), 6q) w is not a limit 
point of [0, l] w+1 \ Fj for j G {1, 2}. Using (5.29) there are points 9 ., w G A 

arbitrarily close to Oq.^ G [0, From (5.26) and the continuity of rj, all 

such points 0q :w close enough to 0q£, have 6q :w G Fj n A and t](6q :w ) > £. 
Using (3.1), (5.27) and (5.30) these points are in Bj, so 

(5.31) SUp 7/(00:™) >£, jE {1,2}. 

Let Int(-) denote set interior with respect to the space A. We claim that 

(5.32) {0 0:w G A : 7/(00:™) > £} C Int(£i) U Int(S 2 ), 

which can be seen as follows. Take any 9q- w G A such that t]{0q- w ) > £. By 
(5.27), (5.28), (5.30) and since 77 is continuous, 0o : «, G Int(i?i U B 2 ). Because 
B\ and B 2 are separated by distance e > 0, 6q ]W G Int(-Bi) Ulnt(-B 2 ). 
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Define the alternative parameter spaces Ai = A \ Int(-B 2 ) and A 2 = A \ 

. SUp A T)—( 

Int(5i). By (5.31), sup A . r\ > £ for j € {1,2}. So Sj = — i > for j e 

{1,2}. Combining (5.32) with the fact that sup Aj 77 — Sj > £, 

U 3 Sj = {o . w G Aj:r)(0 0:w ) > sup rj - Sjj C Int(^) C Bj, 

(5.33) 

J'€{1,2}. 

By (5.33), 

(5.34) A \ (Si U B 2 ) C A \ (B 1 U Int(B 2 )) C A \ (E/£ U Int(fl 2 )) = Ai \ . 

Analogously, A\(BiUB 2 )cA 2 \ C/| 2 . 

The regularity conditions of Theorem A.l are verified in the supplemen- 
tary material [Woodard and Rosenthal (2013)] for each of the parameter 
spaces Ai and A 2 . We apply that theorem for each of j £ {1 ; 2}, with pa- 
rameter space Aj, using 5 = Sj and taking n = L/w. This yields 

f p n (A,\ul)y/n _ 5 



lim sup =— ^- <e j a.s. j € {1,2}. 

n^oA P n (U 3 g.) J ~ 

Combining with (5.33)-(5.34) and the fact that [0, l] w+1 \ A has probability 
zero under tt(6q ]W \S), for j G {1, 2} 

fir(0 0:w € [0, 1]- +1 \ (B 1 U B 2 )|S^ 1/{L/W) 

lim sup 



L->oo V ir(Qo; W € Bj\S) 

.. ^(e 0:m £A\(5 1 UB 2 )|S) V /(L/ ' c) 
= lim SUp 7-r —7- 

M9 0M €A j \U 3 s .\S)\U^ 
< lim sup I : — 

L^oo V T{(0O:w G C^. |S) / 

= lim sup : — — < e J 

n->oo V P„(^|.) J 

almost surely. □ 

5.5. Step 3 of proof of Theorem 3.1. Finally we formalize step 3. 

Theorem 5.3. If there exist e > and two sets B\,B 2 C [0, l] w+1 sep- 
arated by Euclidean distance e such that the ratios in (5.25) decrease expo- 
nentially in L, then the quantity d in (5.11) decreases exponentially in L. 

Theorem 5.3 is proven in the supplementary material [Woodard and Rosen- 
thal (2013)]. Theorems 5.1-5.3 together imply Theorem 3.1. 
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6. Conclusions. The Gibbs sampling method is a popular approach to 
finding gene regulatory binding motifs, but its poor convergence in practice 
means that it can only be used to generate candidate motifs that must be 
ranked using a secondary criterion. If one could efficiently obtain samples 
from the posterior distribution, these samples could be used to directly find 
the "best," that is, most probable, motifs, obviating the need for secondary 
analysis. We have obtained theoretical and empirical results showing that 
the convergence of the Gibbs sampler is even worse than previously realized. 
Our results reinforce the need to convey the limitations of any estimates 
obtained using the Gibbs sampler, and the need to develop more efficient 
Markov chain methods for motif discovery. 

Although our main result (Theorem 3.1) is phrased in terms of a specific 
model, the methods used to prove this result are very widely applicable 
to situations with i.i.d. data, where the data are not necessarily generated 
according to the model, and where the function E log f(X\9) is multimodal. 
The extent to which slow mixing holds in other contexts will be determined 
by how generally this multimodality condition holds, so we are currently 
investigating this condition in detail. 

APPENDIX A: BAYESIAN ASYMPTOTICS 

We quote a result from Berk (1966) on Bayesian asymptotics for i.i.d. ob- 
servations. Let f(x\8) be the density (with respect to some cr-finite measure 
on a space y) of each observation X{ under the inference model, parame- 
terized by 9 € A where A is a Borel subset of a complete separable metric 
space. Let the true distribution of the observations be denoted by G. De- 
fine: 

(1) The "carrier" of a distribution P: the smallest relatively closed set 
having probability one under P. 

(2) P n : the posterior distribution of 9 with n observations, with respect 
to a prior P having carrier A. 

(3) rj(8) = E log f(X\9) where the expectation is taken with respect to 
X ~ G. 

(4) 7]* =snp{7](9):9 &A}. 

(5) Us = {8 € A : 77(0) > rf - 5} for S > 0. 

Assume that: 

(1) f(x\9) is measurable jointly in x and 9; for G-almost every x, f(x\9) 
is continuous in 9. 

(2) For all 9 € A, G{x : f(x\9) > 0} = 1. 

(3) For any compact FcA, EsvLo e&F |log/(X|0)| < oo. 

(4) 7](9) is continuous. 



30 



D. B. WOODARD AND J. S. ROSENTHAL 



(5) For any real number r there is a co-compact set D C A {D c = A \ D 
is compact) and a cover D\, . . . , Dk of D such that 

(A.l) £suplog/(X|0)<r, k€{l,...,K}. 

eeD k 

With these assumptions, we have Theorem A.l. 

Theorem A.l [Berk (1966)]. For G-almost every sequence of observa- 
tions {xi : i € N} and any 5 > 0, 

Theorem A.l is a sub-result given in the proof of Berk's main theorem. 
It says that the posterior probability of Ug decreases exponentially in n. 
Here we have stated the result slightly more generally than Berk (1966) in 
the sense that we replace his assumption (iii) with the only two relevant 
consequences of that assumption: our assumptions (3)— (4). Also, in our as- 
sumption (5) we have allowed a cover of D, whereas Berk's assumption (iv) 
takes K = 1 . The extension to the case of general K is immediate from his 
proof. 

APPENDIX B: TOOLS FOR BOUNDING SPECTRAL GAPS 

Let P and Q be transition kernels that are reversible with respect to dis- 
tributions /Up and fiQ on a (general) state space X with countably-generated 
ex-algebra. Let P\b for B C X indicate the restriction of P to B, which is 
defined to have state space B and transition probabilities identical to P 
except that any move to B c is rejected, 

(B.l) P\ B (x,D) = P(x,D) + l {xeD} P(x,B c ), x£B,DcB. 

Also let /xp|_e be the restriction of fip to B, that is, 

(B.2) np\ B (dx) = fip(dx)/iip^B), x 6 B. 

Then P\b is reversible w.r.t. ^p\p- 

For a partition {Bj}j =1 of X, let P be the projection matrix of P with 
respect to {Bj} J - =l , defined to have state space {1,..., J} and i,j element 
equal to the probability that P transitions to Bj, given that the current 
state is in B{. That is, 

P(i,j) = j (iplB^P&Bj), i,je{l,...,J}. 
The matrix P is reversible w.r.t. /2, where £t(j) = fip(Bj). 

Lemma B.l [Madras and Zheng (2003)]. For any N G N we have 
Gap(P) > ^Gap(P JV ). 
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Although Madras and Zheng (2003) state this result for finite state spaces, 
their proof also holds for general state spaces. 

Theorem B.l [Madras and Randall (2002)]. Let hp = (j,q, and let {Bj}j =l 

be any partition of X . Assume that P is nonnegative definite and let P 1 / 2 
be its nonnegative square root. Then 

Gap(P 1 / 2 QP 1 / 2 ) > Gap(P) minGap(Q|p), 

3 

Gap(P) <Gap(P), 
where P is the projection matrix of P with respect to {Bj}j =1 . 

Theorem B.2 [E.g., Sinclair (1992)]. For X finite define 

a , , , s a yZ^cn fJ>p(x)P(x, B c ) 
$ P = min $p(B), <&p(B) = ,l} ,„ ' -. 

Here $p is called the "conductance," and &p(B) is referred to as the con- 
ductance of the set B. Then Gap(P) < 2<3?p. 

Theorem B.3 [Diaconis and Saloff-Coste (1996)]. Take any N € N, and 
let Pk, k = 0, ...,N, be [i^-reversible transition kernels on state spaces X k . 
Let P be the transition kernel with state x = (xq, • • • , x n) in the space X = 
H k X k , given by 

N 

P(x,dy) = ^2b k P k (x k ,dy k )5 X[ _ k] {y [ _ k] )dy [ _ k] , x,y € X 

k=0 

for some set ofb k > such that ^2 k b k = l, where 5 is Dirac's delta function, 
and where Xr^i indicates the vector x excluding x k . P is called a product 
chain with "component" chains P k . It is reversible with respect to fip(dx) = 
Y[ k H k {dx k ), and 

Gap(P) = min^ftfc Gap(Pfc). 

Lemma 3.2 of Diaconis and Saloff-Coste (1996) states Theorem B.3 for 
finite state spaces; however, the proof holds in the general case. 

Lemma B.2. Take finite X and hp = [iQ- If3b>0 such that bQ(x, y) < 
P(x, y) for every x, y G X such that x ^ y , then b Gap(Q) < Gap(P) . 

Proof. The proof is nearly identical to that of Lemma 5.1 in Woodard, 
Schmidler and Huber (2009a). □ 

Lemma B.2 is closely related to Peskun ordering results; cf. Peskun (1973), 
Tierney (1998), Mira (2001). 
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Theorem B.4 [Sinclair (1992), Diaconis and Stroock (1991)]. For X 

finite, define a simple path j Xi y between every ordered pair x,y G X in the 
graph of the Markov chain with transition matrix P. A simple path is a 
sequence of connected edges with no repeated vertices. Define the quantity 



where 8 is the set of edges, where j X)V 3 (z,v) is a path using the edge (z,v), 
and where len(7 X)3/ ) is the number of edges in "f x ,y Then Gap(P) > p~ x . 
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